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Abstract 

We consider a general class of regularization methods which learn a vector of parameters on the 
basis of linear measurements. It is well known that if the regularizer is a nondecreasing func- 
tion of the inner product then the learned vector is a linear combination of the input data. This 
result, known as the representer theorem, is at the basis of kernel-based methods in machine 
learning. In this paper, we prove the necessity of the above condition, thereby completing the 
characterization of kernel methods based on regularization. We further extend our analysis to 
regularization methods which learn a matrix, a problem which is motivated by the application 
to multi-task learning. In this context, we study a more general representer theorem, which 
holds for a larger class of regularizers. We provide a necessary and sufficient condition for 
these class of matrix regularizers and highlight them with some concrete examples of practical 
importance. Our analysis uses basic principles from matrix theory, especially the useful notion 
of matrix nondecreasing function. 



1 Introduction 



Regularization in Hilbert spaces is an important methodology for learning from examples and has 
a long history in a variety of fields. It has been studied, ffom di fferent perspectives, in statistics 
[|Wahbal . ll990n . in optimal estimation [IMicchelli and Rivlinlll985ll and recent l y has been a focus of 
attention in machine learning theo r y - see, for example. llCucker an d Smal el 1200 iL lOe Vito et al. . 



2004 IMicchelli and Pontii l2005aL IShawe-Tavlor and Cristianinii 12004 iVapnikL 1200011 and refer 



ences therein. Regularization is formulated as an optimization problem involving an error term 
and a regularizer. The regularizer plays an important role, in that it favors solutions with certain 
desirable properties. It has long been observed that certain regularizers exhibit an appealing prop- 
erty, called the representer theorem, which state s that there ex ists a solution of the regularization 
problem that is a linear combination of the data [|Wahbal . 1 1990(1 . This property has important com- 
putational implications in the context of regularization with positive semidefinite kernels, because 
it makes high or infinite-dimensional problems of this type into finite dimensional problems of the 
size o f the number of available data llScholkopf and Smolal . l2002l IShawe-Taylor and Cristianinii 



20041] ■ 



The topic of interest in this paper will be to determine the conditions under which representer 
theorems hold. In the first half of the paper, we describe a property which a regularizer should 
satisfy in order to give rise to a representer theorem. It turns out that this property has a simple 
geometric interpretation and that the regularizer can be equivalently expressed as a nondecreasing 
function of the Hilbert space norm. Thus, we show that this condition, which has already been 
known to be sufficient for representer theorems, is also necessary. In the second half of the paper, 
we depart from the context of Hilbert spaces and focus on a class of problems in which a matrix 
structure plays an important role. For such problems, which have recently appeared in several 
machine learning applications, we show a modified version of the representer theorem that holds 
for a class of regularizers significantly larger than in the former context. As we shall see, these 
matrix regularizers are important in the context of multi-task learning: the matrix columns are the 
parameters of different regression tasks and the regularizer encourages certain dependences across 
the tasks. 



In general, we consider problems in the framework of Tikhonov regularization [[Tikhonov and Arsenin . 

19771) . This regularization approach receives a set of input/output data (xi, yi), . . . , {xm, ym) £ 



Hxy and selects a vector in H as the solution of an optimization problem. Here, His a prescribed 
Hilbert space equipped with the inner product (■, •) and C M a set of possible output values. The 
optimization problems encountered in regularization are of the type 



mm 



{£{{{w,Xi), . . . , {w,Xm)) , (yi, ■ ■ ..ym)) +lVL{w) : W & U] 



(1.1) 



where 7 > is a regularization parameter. The function 8 : x 3^™ ^ M is called an error 
function and : 7i ^ M is called a regularizer. The error function measures the error on the 
data. Typically, it decomposes as a sum of univariate functions. For example, in regression, a 
common choice would be the sum of square errors, YlT=ii{'^^ ~ ViY- The function Vt, called 
the regularizer, favors certain regularity properties of the vector w (such as a small norm) and can 
be chosen based on available prior information about the target vector. In some Hilbert spaces such 
as Sobolev spaces the regularizer is measure of smoothness: the smaller the norm the smoother the 
function. 
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This framework includ es several well-studied lea rning algorithms, s uch as ridge regression 
[ Hoerl and Kennardl 1 197011. support vector machines [IB oser et all 1199211 . and many more - see 
nScholkopf and Smolal I2OO2I IShawe-Taylor and Cristianini, i2004ll and references therein. 

An important aspect of the practical success of this approach is the observation that, for certain 
choices of the regularizer, solving (11.11 ) reduces to identifying m parameters and not dim(?-^). 
Specifically, when the regularizer is the square of the Hilbert space norm, the representer theorem 
holds: there exists a solution w of (11.11) which is a linear combination of the input vectors. 



w 



E 

i=l 



(1.2) 



where q are some r eal coefficients. This result is simple to prove and dates at least from the 1970's, 
see, for example, [ Kimeldorf and Wahbal 1197011 . It is also known that it extends to any regular- 
izer that is a nondecreasing function of the norm [|Sch6lkopf et al.L 1200 IQ . Several other variants 



and results about the representation form (11.21) have also appeared in recent years IIDe Vito et al. 



2004lDinuzzo et al.1 . 120071 . lEvgeniou et al. 



2000LlGirosi et al.Lll995LlMicchelli and PontilL l2005b 



SteinwartL 1200311 . Moreover, the representer theorem has been important in m achine learning, p ar- 
ticularl y within the contex t of learning in r eproducing kernel Hilbe rt spaces ['Aronszajn, 1950 1 - 
see [iScholkopf and Smolal 12002. .Shawe-Taylor and Cristianinil 120041 and references therein. 



Our first objective in this paper is to derive necessary and sufficient conditions for representer 
theorems to hold. Even though one is mainly interested in regularization problems, it is more 
convenient to study interpolation problems, that is, problems of the form 



min {VL{w) : w eH, {w, Xi) = Hi, Vz = 1, . . . , m} 



(1.3) 



Thus, we begin this paper (Section O by showing how representer theorems for interpolation and 
regularization relate. On one side, a representer theorem for interpolation easily implies such a 
theorem for regularization with the same regularizer and any error function. Therefore, all repre- 
senter theorems obtained in this paper apply equally to interpolation and regularization. On the 
other side, though, the converse implication is true under certain weak qualifications on the error 
function. 

Having addressed this issue, we concentrate in Section|3]on proving that an interpolation prob- 
lem (11.31) admits solutions representable in the form (|1.2I) if and only //"the regularizer is a nonde- 
creasing function of the Hilbert space norm. That is, we provide a complete characterization of 
regularizers that give rise to representer theorems, which had been an open question. Furthermore, 
we discuss how our proof is motivated by a geometric understanding of the representer theorem, 
which is equivalently expressed as a monotonicity property of the regularizer. 

Our second objective is to formulate and study the novel question of representer theorems for 
matrix problems. To make our discussion concrete, let us consider the problem of learning n linear 
regression vectors, represented by the parameters wi, . . . , w„ G M'^, respectively. Each vector can 
be thought of as a "task" and the goal is to jointly learn these n tasks. In such problems, there is 
usually prior knowledge that relates these tasks and it is often the case that learning can improve if 
this knowledge is appropriately taken into account. Consequently, a good regularizer should favor 
such task relations and involve all tasks jointly. 
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In the case of interpolation, this learning framework can be formulated concisely as 



mm{n{W) : W G Md,n , wjxu = yu Wi 



,mt, t 



n} 



(1.4) 



where M^^ denotes the set of d x n real matrices and the column vectors wi, 



,Wr. 



form 



the matrix W . Each task t has its own input data x^, • • • , Xtmt ^ and corresponding output 
values yti,.. ■,ytmt ^ 3^- 

An important feature of such problems that distinguishes them from the type (11.31 ) is the ap- 
pearance of matrix products in the constraints, unlike the inner products in (11.31) . In fact, as we 
will discuss in Section l4~n problems of the type (11.41) can be written in the form (11.31) . Conse- 
quently, the representer theorem applies if the matrix regularizer is a nondecreasing function of 
the Frobenius no However, the optimal vector wt for each task can be represented as a linear 
combination of only those input vectors corresponding to this particular task. Moreover, with such 
regularizers it is easy to see that each task in (11.41) can be optimized independently. Hence, these 
regularizers are of no practical interest if the tasks are expected to be related. 

This observation leads us to formulate a modified representer theorem, which is appropriate for 
matrix problems, namely. 



Wt 



n, 



(1.5) 



s=l 



where c^*/ are scalar coefficients, for t, s = 1, . . . , n, i = 1, . . . , m^. In other words, we now 
allow for all input vectors to be present in the linear combination representing each column of the 
optimal matrix. As a result, this definition greatly expands the class of regularizers that give rise 
to representer theorems. 

Moreover, this framework can be applied to many applications where matrix optimization 
problems are involved. Our immediate motivation, however, has been more specific than that, 
namely multi-task learning. Learning multiple tasks join tly has been a growing area of interest 



in machine l ea rning, especially during the past few years llAbernethv et all l2006l lArgyriou et al 



2008. Izenman. 1975 



Maureiil2006al jb 



2006L I2007allbl. iCandes and Rechi \20m iCavallanti et al , 

Srebro et al.L I2005L IWolf et all l200l Ixiang and Bennettl Eooi. lYuan et all bOOvll . For instance, 
some of these works use regularizers which involve the trace norn^ of matrix W. The general idea 
behind this methodology is that a small trace norm favors low-rank matrices. This means that the 
tasks (the columns of W) are related in that they all lie in a low-dimensional su bspace of MJ^. In 
the case of the trace norm, the representer theo rem (11.51) is known to hold - see [lAbemethy et al. . 
2006LlArgvriou et al.l . l2007al . lAmit et al.Ll2007l ]. also discussed in Section IJTJ 

It is natural, therefore, to ask a question similar to that in the standard Hilbert space (or single- 
task) setting. That is, under which conditions on the regularizer a representer theorem holds. In 
Section |4^ we provide an answer by proving a necessary and sufficient condition for representer 
theorems to hold, expressed as a simple monotonicity property. This property is analogous to the 
one in the Hilbert space setting, but its geometric interpretation is now algebraic in nature. We also 
give a functional description equivalent to this property, that is, we show that the regularizers of 
interest are the matrix nondecreasing functions of the quantity W^W . 



'Defined as ||T^||2 = yJtr{W^W). 

^Equal to the sum of the singular values of W . 
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Our results cover matrix problems of the type (11.41) which have already been studied in the lit- 
erature. But they also point towards some new learning methods that may perform well in practice 
and can now be made computationally efficient. Thus, we close the paper with a discussion of 
possible regularizers that satisfy our conditions and have been used or can be used in the future in 
machine learning problems. 

1.1 Notation 

Before proceeding, we introduce the notation used in this paper. We use as a shorthand for the 
set of integers {1, . . . ,d}. We use R'^ to denote the linear space of vectors with d real components. 
The standard inner product in this space is denoted by (■, ■), that is, {w, v) = XlieNd ^ ^ 

W^, where Wj, Vi are the i-th components of w, v respectively. More generally, we will consider 
Hilbert spaces which we will denote by Ti, equipped with an inner product (■,■). 

We also let M(i „ be the linear space of x n real matrices. lfW,Z G „ we define their 
Frobenius inner product as {W, Z) = ti{W^Z), where tr denotes the trace of a matrix. With 
S'^ we denote the set of d x d real symmetric matrices and with S'^ (S'[_|_) its subset of positive 
semidefinite (definite) ones. We use >- and >z for the positive definite and positive semidefinite 
partial orderings, respectively. Finally, we let O'^ be the set ofdxd orthogonal matrices. 



2 Regularization versus Interpolation 

The line of attack we shall follow in this paper will go through interpolation. That is, our main 
concern will be to obtain necessary and sufficient conditions for representer theorems that hold for 
interpolation problems. However, in practical applications one encounters regularization problems 
more frequently than interpolation problems. 

First of all, the family of the former problems is more general than that of the latter ones. In- 
deed, an inter polation problem can be sirn ply obtained in the limit as the regularization parameter 
goes to zero fMicchelli and PinkusL Il994|l . More importantly, regularization enables one to trade 



off interpolation of the data against smoothness or simplicity of the model, whereas interpolation 
frequently suffers from overfitting. 

Thus, frequently one considers problems of the form 



mm 



{£{{{w,xi), . . . , , (yi, . . .,?/„)) +-iVL{w) -.w en] , (2.1) 



where 7 > is called the regularization parame ter. This param eter is not known in advance but can 



be tuned with techniques like cross validation iWahbal . ll990|] . Here, i7 : 7i ^ M is a regularizer 



E : X y™- ^ M is an error function and Xi & H,yi & y, Vz G Nm, are given input and output 
data. The set 3^ is a subset of M and varies depending on the context, so that it is typically assumed 
equal to M in the case of regression or equal to { — 1, 1} in binary classification problems. One may 
also consider the associated interpolation problem, which is 

min {Q{w) : w eH, {w, Xi) = i/i, \/i G N^} ■ (2.2) 

Under certain assumptions, the minima in problems (|2.1I) and (12.21) are attained (whenever the 
constraints in (|2.2I) are satisfiable). Such assumptions could involve, for example, lower semi- 
continuity and boundedness of sublevel sets for n and boundedness from below for £. These 
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issues will not concern us here, as we shall assume the following about the error function 8 and 
the regularizer Vt, from now on. 

Assumption 2.1. The minimum (12.11 ) is attained for any 7 > 0, any input and output data {xj, Ui : 
i e Nm} and any m G N. The minimum (|2.2I) is attained for any input and output data {xj, Ui : 
i G fifw^i? fiinj m G N, whenever the constraints in (12.2)) are satisfiable.. 

The main objective of this paper is to obtain necessary and sufficient conditions on n so that 
the solution of problem (|2.1I) satisfies a linear representer theorem. 



Definition 2.1. We 5a)^ ?/za? a class of optimization problems such as (12.11) or (12.21 ) satisfies the 
linear representer theorem if for any choice of data {xi, yi : i E N„i} such that the problem has a 
solution, there exists a solution that belongs to spanjxj : i G Nm}. 

In this section, we show that the existence of representer theorems for regularization problems 
is equivalent to the existence of representer theorems for interpolation problems, under a quite 
general condition that has a rather simpl e geometric interpreta tion. 



We first recall a lemma from [Mic chelli and Pontill l2004l Sec. 2] which states that (linear or 



not) representer theorems for interpolation lead to representer theorems for regularization, under 
no conditions on the error function. 

Lemma 2.1. Let S : x R, ^7 : — M satisfying Assumption \2.1\ Then if the 

class of interpolation problems (|2.2I) satisfies the linear representer theorem, so does the class of 
regularization problems (12.11) . 

Proof. Consider a problem of the form (|2.1I) and let w be a solution. We construct an associated 
interpolation problem 

min {VL{w) : w eH, {w, Xi) = {w,Xi),. . ., {w, Xm) = {w, Xm)} ■ (2.3) 

By hypothesis, there exists a solution w of (|2.3I) that lies in spanjxj : i G Nm}- But then n(w) < 
n{w) and hence w is a solution of (12.11) and the result follows. ■ 

This lemma requires no special properties of the functions involved. Its converse, in contrast, 
requires assumptions about the analytical properties of the error function. We provide one such 
natural condition in the theorem below, but other conditions could conceivably work too. The main 
idea in the proof is, based on a single input, to construct a sequence of appropriate regularization 
problems for different values of the regularization parameter 7. Then, it suffices to show that letting 
7 — s> 0+ yields a limit of the minimizers that satisfies an interpolation constraint. 

Theorem 2.1. Let S : M™ x 3^™ R and i7 : ^ R. Assume that S, are lower semi- 
continuous, that Q has bounded sublevel sets and that S is bounded from below. Assume also that, 
for some v G R"^ \ {0}, y G 3^™, there exists a unique minimizer o/min{£^(af , y) : a E R} and 
that this minimizer does not equal zero. Then if the class of regularization problems (|2.1I) satisfies 
the linear representer theorem, so does the class of interpolation problems (|2.2I) . 

Proof. Fix an arbitrary x 7^ and let be the minimizer of min{£^(af , ?/) : a G R}. Consider the 
problems 

min i^i 1 (w, x) f , y j +7 Vt{w) : w G H 
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for every 7 > 0, and let be a solution in the span of x (known to exist by hypothesis). We then 
obtain that 

S (aov, y) + 7^('"^7) < ^ (]|^ ^' + 7 fi(w^) <S (ao y) + 7 (s) . (2.4) 

Thus, Vl{w^i) < VL {x) and so, by the hypothesis on Vt, the set {w^ : 7 > 0} is bounded. Therefore, 
there exists a convergent subsequence {w^^ : £ G N}, with 7^ 0+, whose limit we call w. By 
taking the limits as £ — 00 on the inequality on the right in (|2.4I) . we obtain 



and consequently 



(w, x) = ao 



or 



{w, x) = \\x 



|2 



In addition, since w.y belongs to the span of x for every 7 > 0, so does w. Thus, we obtain that 
w = X. Moreover, from the definition of Wy we have that 

S 1*12 (^7' ^5 1/^ +7 ^(^^7) ^ ^ (cto ^) y) + 7 ^{w) Ww G such that {w, x) = 

and, combining with the definition of oq, that 

n{wy) < n{w). Vu; G such that {w, x) = 

Taking the limits as £ — 00, we conclude that w = x is a solution of the problem 

mm{n{w) : w E H, {w, x) = ||x||^} . 

Moreover, this assertion holds even when x = 0, since the hypothesis implies that is a global 
minimizer of ^l. Indeed, any regularization problem of the type (12.11 ) with zero inputs, Xj = 0, Vi G 
Nm, admits a solution in their span. Thus, we have shown that n satisfies property (13.31 ) and the 
result follows immediately from Lemma [311 ■ 

We now comment on some commonly used error functions. The first is the square loss. 



for z,y E M"*. It is immediately apparent that Theorem 12.11 applies in this case. 
The second case is the hinge loss. 



^i.z,y)= ^ max(l - Zi?/i,0) 
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Figure 1: Hinge loss along the direction (1,— 2,0,...,0). 



where the outputs i/i are assumed to belong to { — 1,1} for the purpose of classification. In this 
case, we may select yi = 1, Vz G Nm, and v = (—1, —2, 0, . . . , 0)^ for m > 2. Then the function 
£{-v,y) is the one shown in Figure [TJ 
Finally, the logistic loss, 

S{z,y)= J] log (1 + e-^»^0 ' 

is also used in classification problems. In this case, we may select yi = 1,V2 G Nm, and v = 
(2, —1)^ for m = 2 or t> = (m — 2, —1, . . . , —1)^ for m > 2. In the latter case, for example, 
setting to zero the derivative of £{■ v, y) yields the equation (m — l)e"'^™^^) + — m + 2 = 0, 
which can easily be seen to have a unique solution. 
Summarizing, we obtain the following corollary. 

Corollary 2.1. If£ : x y"^ M is the square loss, the hinge loss (for m > 2) or the logistic 
loss (for m > 2) and Q : Ti ^ M. is lower semi-continuous with bounded sublevel sets, then the 
class of problems (|2.1I) satisfies the linear representer theorem if and only if the class of problems 
(IZ2l) does. 

Note also that the condition on £ in Theorem l2.1l is rather weak in that an error function £ may 
satisfy it without being convex. At the same time, an error function that is "too flat", such as a 
constant loss, will not do. 

We conclude with a remark about the situation in which the inputs Xi are linearly independent^ 
It has a brief and straightforward proof, which we do not present here. 

Remark 2.1. Let £ be the hinge loss or the logistic loss and Q : Ti. —>■ be of the form Q{w) = 
h{\\w\\), where h : ^ M.is a lower semi-continuous function with bounded sublevel sets. Then 
the class of regularization problems (12.11) in which the inputs Xi,i & Nm, cire linearly independent, 
satisfies the linear representer theorem. 



This occurs frequently in practice, especially when the dimensionality d is high. 



3 Representer Theorems for Interpolation Problems 



The results of the previous section allow us to focus on linear representer theorems for interpolation 
problems of the type (12.21) . We are going to consider the case of a Hilbert space Ti as the domain of 
an interpolation problem. Interpolation constraints will be formed as inner products of the variable 
with the input data. For all purposes in this context, it makes no difference to think of Ti as being 
equal to W^. 

In this section, we consider the interpolation problem 

min{fi(w) -.weH, {w,Xi) = y^i e N^}, (3.1) 

We coin the term admissible to denote the class of regularizers we are interested in. 

Definition 3.1. We say that the function Q : Ti 'Ris admissible if, for every m eN and any data 
set {{xi,yi) : i G Nm} Ti. x y such that the interpolation constraints are satisfiable, problem 
(|3.1I) admits a solution w of the form 



w = 

where ci are some real parameters. 

We say that : 7i ^ M is differentiable if, for every w e H, there is a unique vector denoted 
by Vf2(w), such that for all p E H, 

ito ^(vn(u,).p). 

This notion corresponds to the usual notion of directional derivative on R.'^' and in that case V^l{w) 
is the gradient of at w. 

In the remainder of the section, we always assume that Assumption 12.11 holds for fi. The 
following theorem provides a necessary and sufficient condition for a regularizer to be admissible. 

Theorem 3.1. Let Q : H ^ M. be a differentiable function and dim(7i) > 2. Then Q is admissible 
if and only if 

n{w) = h{{w, w)) ywE n, (3.2) 

for some nondecreasing function h : IR+ K. 

It is well kno wn that the above func tional form is sufficient for a representer theorem to hold 
(see for example llScholkopf et al.L 120011 ). Here we show that it is also necessary. 

The route we follow to proving the above theorem is based on a geometric interpretation of 
representer theorems. This intuition can be formally expressed as condition (13.31) in the lemma 
below. Both condition (|3.3I) and functional form (13.21) express the property that the contours of 
are spheres (or regions between spheres), which is apparent from Figure [2l 

Lemma 3.1. A function Q : Ti. —>■ R is admissible if and only if it satisfies the property that 

fl{w + p) > Q{w) W w,p E H such that {w,p) = 0. (3.3) 
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Figure 2: Geometric interpretation of Theorem I3.1[ The function should not decrease when 
moving to orthogonal directions. The contours of such a function should be spherical. 



Proof. Suppose that Q satisfies property (13.31) . consider arbitrary data Xi, yi,i E N^, and let w 
be a solution to problem (13.11) . We can uniquely decompose w as w = w + p where w E C : = 
span{xj : i E N^} and p E C^. From (13.31) we obtain that il{w) > il{w). Also w satisfies the 
interpolation constraints and hence we conclude that w is a solution to problem (13.11) . 

Conversely, if f2 is admissible choose any w E H and consider the problem rmn{n(z) : z E 
H, {z, w) = {w, w)}. By hypothesis, there exists a solution belonging in span{w} and hence w is a 
solution to this problem. Thus, we have that n(w+p) > ^l{w) for every p such that {w, p) = 0. ■ 

It remains to establish the equivalence of the geometric property (13.31) to condition (13.21) that Q 
is a nondecreasing function of the L2 norm. 

Proof of Theorem 1X7] Assume first that (|3.3I) holds and dim(?i) < 00. In this case, we only need 
to consider the case that 7-^ = R"' since (|3.3I) can always be rewritten as an equivalent condition on 
M°', using an orthonormal basis of H. 

First we observe that, since is differentiable, this property implies the condition that 

{Vn{w),p) = 0, (3.4) 

for all w,p eW^ such that {w, p) = 0. 

Now, fix any wq E MJ^ such that ||wo|| = 1. Consider an arbitrary w E W^. Then there exists 
an orthogonal matrix U E O'^ such that w = \\w\\Uwo and det(f/) = 1 (see Lemma [STTI in the 
appendix). Moreover, we can write U = for some skew- symmetric matrix D E ^ — see 



[|Hom and JohnsonLll99lL Example 6.2.15]. Consider now the path z : [0, 1] R'^ with 

z{X) = \\w\\e^^Wo VAg[0, 1]. 

We have that z{0) = \\w\\wo and z{l) = w. Moreover, since {z{X), z{X)) = {w, w), we obtain that 

{z'iX),ziX))=0 VAg[0,1]. 
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Applying (|3.4I) with w = z{\),p = z'{X),it follows that 

^^MM = (v^](.(A)),/(A)) = o. 

Consequently, A)) is constant and hence ^7(||^f||^fo) = ^l{w). Setting /i(^) = ^l{^/^wo), G 
M+, yields (|3.2I) . In addition, /i must be nondecreasing in order for to satisfy property (13.31) . 
For the case dim(7-^) = oo we can argue similarly using instead the path 

(1 - X)wo + Xw 
\\il-X)wo + Xw\r^^ 

which is differentiable on [0, 1] when w ^ spanjwo}. We confirm equation (13.21 ) for vectors in 
spanjiwo} by a limiting argument on vectors not in spanjwo} since Q is surely continuous. 

Conversely, if ^l{w) = h{{w, w)) and h is nondecreasing, property (13.31) follows immediately. 

■ 

We note that we could modify Definition [3TT] by requiring that any solution of problem (13.11 ) be 
in the linear span of the input data. We call such regularizers strictly admissible. Then with minor 
modifications to Lemma ISTTI (namely, requiring that equality in (|3.3I) holds only if p = 0) and to 
the proof of Theorem 13.11 (namely, requiring h to be strictly increasing) we have the following 
corollary. 

Corollary 3.1. Let Q : Ti. ^ W be a differentiable function. Then Q is strictly admissible if and 
only ifQ{w) = h{{w, w)), Ww G H, where h : IR+ M is strictly increasing. 

Theorem B.ll can be used to verify whether the linear representer theorem can be obtained when 
using a regularizer Vt. For example, the function \\w\\p = (^jgp^^ is not admissible for any 

P > 0,p 7^ 2, because it cannot be expressed as a function of the Hilbert space norm. Indeed, if 
we choose any a G R and let w = (a5,;i : i G N^), the requirement that \\w\\p = h{{w, w)) would 
imply that /i(a^) = |a|,Va G M, and hence that \\w\\p = \\w\\. 



4 Matrix Learning Problems 

In this section, we investigate how representer theorems and results like Theorem 13.11 can be ex- 
tended in the context of optimization problems that involve matrices. 



4.1 Exploiting Matrix Structure 

As we have already seen, our discussion in Section [3] applies to any Hilbert space. Thus, we may 
consider the finite Hilbert space of dx n matrices M^^ „ equipped with the Frobenius inner product 
(■,■). As in Section [3l we could consider interpolation problems of the form 

min{Q{W) : W G M,,„, {W,X,) =y,,te N„} (4.1) 

where G M^ ^ are prescribed input matrices and yi E y scalar outputs, for i G Nm- Then 
Theorem 13.11 states that such a problem admits a solution of the form 

W=Y, c.^^, (4.2) 
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where Cj are some real parameters, if and only if Q can be written in the form 



Q{W) = h{{W,W)) 



yWeM,n, 



(4.3) 



where h : M+ — >^ M is nondecreasing. 

However, optimization problems of the form (14.11) do not occur frequently in machine learning 
practice. The constraints of (14.11) do not utilize the structure inherent in matrices - that is, it makes 
no difference whether the variable is regarded as a matrix or as a vector - and hence have limited 
applicability. In contrast, in many recent applications, some of which we shall briefly discuss 
below, it is natural to consider problems like 



min {n{W) : W G Mrf,„ , wjxu = Vu Vi G N^,, t G N„} . 



(4.4) 



Here, Wt G M.^ denote the columns of matrix W, for t G N„, and xu G W^, yu G 3^ are prescribed 
inputs and outputs, for i G N^iji G N„. In addition, the desired representation form for solu- 
tions of such matrix problems is different from (14.21) . In this case, one may encounter representer 
theorems of the form 



Vt G 



(4.5) 



where c^*^ are scalar coefficients for s, t G N„, i G 

To illustrate the ab ove, consi der the p roblem of multi-ta sk learning and prob lems clos ely related 



2008 



Izenmanl . I1975L iMaured . l2006al jbL ISrebro et all I2005L lYuan et all I2007L etc.]. In learn 



to it llAbemethv et al.l.[ 2006. A rgvriou et al., 2006, 2007allbl.lCandes and RechtLl20 08. Cav allanti et al. 



ing multiple tasks jointly, each task may be represented by a vector of regression parameters 
that corresponds to the column Wt in our notation. There are n tasks and rrit data examples 
{{xti, yu) '■ i G Nmt} for the t-th task. The learning algorithm used is 



min {£{wjxu,yu : ^ G N„,,t G N„) +7fi(Vr) : W G Md,„} 



(4.6) 



where £ : M x y^'' ^ M, M = XlteN„ ^t- The error term expresses the objective that the 
regression vector for each task should fit well the data for this particular task. The choice of the 
regularizer f2 is important in that it captures certain relationships between the tasks. One common 
choice is the trace norm, which is defined to be the sum of the singular values of a matrix or, 
equivalently, 

n{W) = \\W\\i := ti{W^W)^ . 

Regularization with the trace norm learns the tasks as one joint optimization problem, by favor- 
ing matrices with low rank. In other words, the vectors Wt are related in that they are all linear 
combinations of a small set of basis vectors. It has been demonstrated that this approach allows 
for accurate estimation of related tasks even when there are only few data points available for each 
task. 

Thus, it is natural to consider optimization problems of the form (14.41) . In fact, these problems 
can be seen as instances of problems of the form (14.11) . because the quantity wJxu can be written as 
the inner product between W and a matrix having all its columns equal to zero except for the t-th 
column being equal to Xu- It is also easy to see that (14.11) is a richer class since the corresponding 
constraints are less restrictive. 
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Despite this fact, by focusing on the class (14.41) we concentrate on problems of more practical 
interest and we can obtain representer theorems for a richer class of regularizers, which includes 
the trace norm and other useful functions. In contrast, regularization with the functional form (14.31) 
is not a satisfactory approach since it ignores matrix structure. In particular, regularization with the 
Frobenius norm (and a separable error function) corresponds to learning each task independently, 
ignoring relationships among the tasks. 

A representer theorern of the form (14.51) for regularization wi th the trace norrn has been shown 
in I Argyriou et al. . 2007 a^. Related results have also appeared in [ Abemethy et al. . 2006 . Amit et al. 
200711 . We repeat here the statement and the proof of this theorem, in order to better motivate our 
proof technique of Section |431 

Theorem 4.1. IfVt is the trace norm then problem (14.41 ) (or problem (14.61 )) admits a solution W of 
the form (1431). for some cf^ eR, i e N.^,, s, t G N„. 



Proof. Let ly be a solution of (|4.4I) and let £ := span{x<ji : s e i e }• We can decompose 
the columns of W sls Wt = Wt + Pt, Vt G N„, where Wt G C and pt G C^. Hence W = W + P, 
where W is the matrix with columns Wt and P is the matrix with columns pt. Moreover we have 
that P^W = 0. From Lemma [5^ in the appendix, we obtain that ||W^||i > ||W^||i- We also 
have that (wt^Xu) = {wt,Xti), for every i e Nmt,^ G N„. Thus, W preserves the interpolation 
constraints (or the value of the error term) while not increasing the value of the regularizer. Hence, 
it is a solution of the optimization problem and the assertion follows. ■ 

A simple but important observation about this and related results is that each task vector wt is 
a linear combination of the data for all the tasks. This contrasts to the representation form (14.21) 
obtained by using Frobenius inner product constraints. Interpreting (14.21) in a multi-task context, by 
appropriately choosing the Xj as described above, would imply that each wt is a linear combination 
of only the data for task t. 

Finally, in some applications the following variant, similar to the type (14.41) . has appeared. 



min {n{W) : W G M^,, , wjxi = yu Vi G N^, t G N,} . 



(4.7) 



Problems of this type corresponds to a special case in multi-task learning applications in which the 
input points are the same for all the tasks. For instance, this is the case with collaborative filtering 
or application s in marketing whe r e the same products/entities are rated by all users/consumers ( see. 



for example, [|Aaker et al.l . 12004 Evgeniou et al.L I2005L iLenk et al.L 1199a. ISrebro et al. 
various approaches to this problem). 
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4.2 Characterization of Matrix Regularizers 

Our objective in this section will be to state and prove a general representer theorem for problems 
of the form (14.41) or (14.71) u s ing a f unctional form analogous to (13.21) . The key insight used in 
the proof of [Argyr iou et al. . 2007^ has been that the trace norm is defined in terms of a matrix 
function that preserves the partial ordering of matrices. That is, it satisfies Lemma [5^ which is a 
matrix analogue of the geometric property (13.31) . To prove our main result (Theorem l4.2l) . we shall 
build on this observation in a way similar to the approach followed in Section [3l 

Before proceeding to a study of matrix interpolation problems, it should be remarked that our 
results will apply equally to matrix regularization problems. That is, a variant of Theorem 12 . 1 1 can 
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be shown for matrix regularization and interpolation problems, following along the lines of the 
proof of that theorem. The hypothesis now becomes that for some V,Y E M„ „, V nonsingular, 
the minimizer of mm{£(AV,Y) : A E M„ „} is unique and nonsingular. As a result, matrix 
regularization with the square loss, the hinge loss or the logistic loss does not differ from matrix 
interpolation with respect to representer theorems. 

Thus, we may focus on the interpolation problems (14.41) and (14.71) . First of all, observe that, 
by definition, problems of the type (14.41) include those of type (14.71) . Conversely, consider a set 
of constraints of the type (14.41) with one input per task (nit = 1, Vt G N„) and not all input 
vectors coUinear. Then any matrix W such that each wt lies on a fixed hyperplane perpendicular 
to Xti satisfies these constraints. At least two of these hyperplanes do not coincide, whereas each 
constraint in (14.71) implies that all vectors Wt lie on the same hyperplane. Therefore, the class of 
problems (14.41) is strictly larger than the class (|4.7I) . 

However, it turns out that with regard to representer theorems of the form (14.51) there is no 
distinction between the two types of problems. In other words, the representer theorem holds 
for the same regularizers independent of whether each task has its own sample or not. More 
importantly, we can connect the existence of representer theorems to a geometric property of the 
regularizer, in a way analogous to property (13.31) in Section[3l These facts are stated in the following 
lemma. 

Lemma 4.1. The following statements are equivalent: 

(a) : Problem (I4.7|) admits a solution of the form (14.51) . for every data set {{xi,yti) : i E Nm, t E 

Nn} C Mrf X M„ „ and every m G N, such that the interpolation constraints are satisfiable. 

(b) : Problem (14.41) admits a solution of the form (14.51) . for every data set {{xu, yu) '■ i E Nmt,t ^ 

N„ } C M'^ X M and every rrit E N, such that the interpolation constraints are satisfiable. 

(c) : The function Vt satisfies the property 

n{W + P)>Vl{W) yW,PEMd,nSuchthatW^P = 0. (4.8) 

Proof. We will show that (a) =^ (c), (c) =^ (b) and (b) =^ (a). 

[(a) =^ (c)] Consider any W E M^^^. Choose m = n and the input data to be the columns 
of W. In other words, consider the problem 

mm{n{Z) : Z E Md,n, Z^W = W^W} . 

By hypothesis, there exists a solution Z = WC for some C E M„_„. Since {Z — WyW = 0, all 
columns ofZ — W have to belong to the null space of W. But, at the same time, they have to lie 
in the range of W and hence we obtain that Z = W. Therefore, we obtain property (14.81) after the 
variable change P = Z — W . 

[(c) ^> (b)] Consider arbitrary xu E W^, yu G 3^, i G N^,, t G N„, and let # be a solution 
to problem (14.41) . We can decompose the columns of W diS Wt = Wt + Pt where Wt E C := 
span{xsi,i E Nm,,s E N„,}, and pt E C^, \/t E N„,. By hypothesis Q(W) > Q{W). Since W 
interpolates the data, so does W and therefore ly is a solution to (14.41) . 

[(b) =^ (a)] Trivial, since any problem of type (14.71) is also of type (14.41) . ■ 
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The above lemma provides us with a criterion for characterizing all regularizers satisfying 
representer theorems of the form (|4.5I) . in the context of problems (14.41) or (14.71) . Our objective 
will be to obtain a functional form analogous to (13.21 ) that describes functions satisfying property 
(14.81 ). This property does not have a simple geometric interpretation, unlike (13.31 ) which describes 
functions with spherical contours. The reason is that the matrix product in the constraint is more 
difficult to tackle than an inner product. 

Similar to the Hilbert space setting (|3.2I) . where we required /i to be a nondecreasing real func- 
tion, the functional description of the regularizer now involves the notion of a matrix nondecreasing 
function. 

Definition 4.1. We say that the function h : S\ ^ is nondecreasing in the order of matrices if 
h{A) < h{B) for all A, B e SI such that A ^ B. 

Theorem 4.2. Let d,n E N with d > 2n. The differentiable function Q : „ R satisfies 
property (14.81) if and only if there exists a matrix nondecreasing function h : ^ M. such that 

n{W) = h{W^W), yW e Md,n. (4.9) 



Proof. We first assume that n satisfies property (|4.8I) . From this property it follows that, for all 
W,P e Md,n with W^P = 0, 

(Vfi(iy),p) = 0. (4.10) 

To see this, observe that if the matrix W^P is zero then, for all e > 0, we have that 

n{w + eP)-n{w) ^ ^ 

e ~ 

Taking the limit as £ ^ 0+ we obtain that (Vfi(iy), P) > 0. Similarly, choosing £ < we obtain 
that (Vfi(iy), P) < and equation (OOl) follows. 

Now, consider any matrix W E M^^. Let r = rank(iy) and let us write W in a singular value 
decomposition as follows 

W = '^ai UivJ , 

where ai > a2 > ■ ■ ■ > ar > are the singular values and Ui E W\ Vi E W\ i E N^, sets of 
singular vectors, so that ujuj = vjvj = 6ij, Wi^j E N^. Also, let m^+i, ■ ■ ■ ,Ud E M'^ be vectors 
that together with ui, . . . ,Ur form an orthonormal basis of R'^. Without loss of generality, let us 
pick Ml and consider any unit vector z orthogonal to the vectors U2, ■ ■ ■ ,Ur. Let k = d — r + 1 and 
g G R'^ be the unit vector such that 

z = Rq, 



where R = {ui, Wr+i, • • • , Ud). We can complete q by adding d — r columns to its right in order 
to form an orthogonal matrix Q E and, since d > n, we may select these columns so that 
det(Q) = 1. F urthermore, we can wr ite this matrix as Q = with D E Mfc ^ a skew-symmetric 
matrix (see [|Hom and Johnson . 199l[ Example 6.2.15]). 
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We also define the path Z : [0, 1] Md,n as 

r 

Z{\) = (TiRe^^eivJ + ^(Ti UivJ VA G [0, 1], 

i=2 

where Ci denotes the vector (1, 0, ... , 0)^. In other words, we fix the singular values, the right 
singular vectors and the r — 1 left singular vectors U2, ■ ■ ■ ,Ur and only allow the first left singular 
vector to vary. This path has the properties that Z{0) = W and Z{1) = aizvj + J2l=2 '^i '^■i'^J ■ 
By construction of the path, it holds that 

Z\\) = aiRe^^DeivJ 

and hence 

Z{XyZ'{X) = {cTiRe^^eivjy aiRe^^DeivJ = alviejDeivJ = 0, 
for every A G [0, 1], because Du = 0. Hence, using equation (14.101) . we have that 

{vn{z{\)),z'{\)) = o 

and, since '^^(^('^)) _ /\/Q,(Z(X)), Z'iX)), we conclude that VtiZiX)) equals a constant inde- 

dX 

pendent of A. In particular, ^^(^(0)) = VL{Z{1)), that is, 

\ i=2 

In other words, if we fix the singular values of W , the right singular vectors and all the left singular 
vectors but one, Vt does not depend on the remaining left singular vector (because the choice of z 
is independent of 

In fact, this readily implies that VL does not depend on the left singular vectors at all. Indeed, fix 
an arbitrary Y G „ such that Y^Y = I. Consider the matrix Y{W^W)2 , which can be written 
using the same singular values and right singular vectors as W. That is. 



where n = Yvi, Vi G N^. Now, we select unit vectors follows: 



Zi = Ml 

Z2 -L Zi,U3, . . .,Ur,Ti 

Zj. _L 2^x, . . . , Zf—i, Ti, . . . , T^—l . 
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This construction is possible since d > 2n. Replacing successively Ui with Zi and then Zi with r^, 
Vi e Nr, and applying the invariance property, we obtain that 



= Q I (TiZivJ + (T2Z2VI + ^ (T'i UiVj 



= Viiy^ ai Zjt 



1=2 



Therefore, defining the function h : 8% ^ R as h{A) = Q(yA5), we deduce that Q.{W) = 
h{W^W). 

Finally, we show that h is matrix nondecreasing, that is, h{A) < h{B) ifO^A^B. For any 
such A, B and since d > 2n, we may define W = [Ai 0, 0]^, P = [0, {B - A)^,OY e M<i,„. 
Then W^P = 0, A = W^W, B = {W + Py{W + P) and thus, by hypothesis, 

h{B) = n{W + P)> n{W) = h{A). 

This completes the proof in one direction of the theorem. 

To show the converse, assume that Q{W) — h{W^W), where the function h is matrix non- 
decreasing. Then for any W, P E Md,n with W^P = 0, we have that {W + Py{W + P) = 
W^W + P^P y W^W and, so, n{W + P) > n{W), as required. ■ 

We conclude this section by providing a necessary and sufficient condition on the matrix non- 
decreasing property of the function h. 

Proposition 4.1. Let /i : S" — > R &e dijferentiable function. The following properties are equiva- 
lent: 

(a) his matrix nondecreasing 

(b) the matrix Vh{A) :— i^-§^ '■ hj ^ ^nj is positive semidefinite, for every A e S!j:. 

Proof. If (a) holds, we choose a; e R"^, i e R and note that 

h{A + txx'^) - h{A) ^ ^ 
t ~ 
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Letting t go to zero gives that x^'Vh{A)x > 0. 

Conversely, if (b) is true we have, for every x E M", that x^Vh(A)x = {Wh{A), xx^) > 
and, so, {\/h{A),C) > for all C G S*^. For any A, B e such that A ^ B, consider the 
univariate function g : [0, 1] M, g{t) = h{A + t{B — A)). By the chain rule it is easy to verify 
that g in nondecreasing. Therefore we conclude that h{A) = g(0) < g(l) = h{B). ■ 



4.3 Examples 

We have briefly mentioned already that functional description (14.91 ) subsumes the special case of 
monotone spectral functions . By spectral functions we simply mean those real-valued functions of 
matrices that depend only on the singular values of their argument. Monotonicity in this case sim- 
ply means that one-by-one orderings of the singular values are preserved. In a ddition, the mono 



tonicity of h in (14.91 ) is a direct consequence of Weyl's monotonicity theorem [ Horn and JohnsonL 



1985 1 Cor. 4.3.3], which states that \f A ^ B then the spectra of A and B are ordered 
Interesting examples of such functions are the Schatten Lp norms and prenorms, 

n{W) = \\W\\p := \\(r{W)\\p, 

where p E [0, +oo) and a(W) denotes the n-dimensional vector of the singular values of W. 
For instance, we have already mentioned in Section 14.11 that the representer theorem holds when 
the regularizer is the trace norm (the Li norm of the spectrum). But it also holds for the rank of 
a matrix, which is the Lq prenorm of the spectrum. Regularization with the rank is an NP-hard 
optimization problem but the representer theorem implies that it can be solved in time dependent 
on the total sample size. 

If we exclude spectral functions, the functions that remain are invariant under left multipli- 
cation with an orthogonal matrix. Examples of such functions are Schatten norms and prenorms 
composed with right matrix scaling, 

n{W) = \\WM\\p, (4.18) 



where M G S". In this case, the corresponding h is the function S i-^ \\^ya{MSM)\\p. To 
see that this function is matrix nondecreasing, observe that if A, B G S" and A ^ B then ^ 
M AM ^ MB M and hence a (MA M) ^ a{MBM) by Weyl's monotonicity theorem. Therefore, 
\\^a{MAM)\\p < \\^a{MBM)\\p. 

Also, the matrix M above can be used to select a subset of the columns of W. In addition, 
more complicated structures can be obtained by summation of matrix nondecreasing functions and 
by taking minima or maxima over sets. For example, we can obtain a regularizer such as 

Q{W)= min J2 

kmK 

where V is the set of partitions of N„ in K subsets and W{Ik) denotes the submatrix of W formed 
by just the columns indexed by J^. This regularizer is an extension of the trace norm and can 
be used for learning multiple tasks via dimensionality reduction on more than one subspaces 



llArgvriou et all 120081] 
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Yet another example of valid regularizer is that considered in [lEvgeniou et al.Ll2005L Sec. 3.1], 
which encourages the tasks to be close to each others, namely 



t=l 



1 " 

- - 

n ^-^ 

s=\ 



This regularizer immediately verifies property (14.81) . and so by Theorem 14.21 it is a matrix non- 
decreasing function of W^W. One can also verify that this regularizer is the square of the form 
(14181) withp = 2. 

Finally, it is worth noting that the representer theorem does not apply to a family of "mixed" 
matrix norms that have been used in both statistics and machine learning, in formulations such as 



the "group Lasso" [|A ntonia dis and Fanll2001 



1999L iLin and Zhangi i2003L lObozinski et al 
form 

n{w) = \\w\ 



Argyriou et al.Ll2006.lBakinlll999llGrandvalet and Canui 



20061 1 Yuan and Lid. l2006ll . These norms are of the 




where denotes the i-th row of W and (p, q) ^ (2, 2). Typically in the literature, q is chosen 
equal to one in order to favor sparsity of the coefficient vectors at the same covariates. 



5 Conclusion 

We have characterized the classes of vector and matrix regularizers which lead to certain forms 
of the solution of the associated regularization problems. In the vector case, we have proved the 
necessity of a well-known sufficient condition for the "standard representer theorem", which is 
encountered in many learning and statistical estimation problems. In the matrix case, we have 
described a novel class of regularizers which lead to a modified representer theorem. This class, 
which relies upon the notion of matrix nondecreasing function, includes and extends significantly 
the vector class. To motivate the need for our study, we have discussed some examples of reg- 
ularizers, which have been recently used in the context of multi-task learning and collaborative 
filtering. 

In the future, it would be valuable to study more in detail special cases of the matrix regularizers 
which we have encountered, such as those based on orthogonally invariant functions. It would 
also be interesting to investigate how the presence of additional constraints affects the representer 
theorem. In particular, we have in mind the possibility that the matrix may be constrained to be in 
a convex cone, such as the set of positive semidefinite matrices. Finally, we leave to future studies 
the extension of the ideas presented here to the case in which matrices are replaced by operators 
between two Hilbert spaces. 
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Appendix 



Here we collect some auxiliary results which are used in the above analysis. 
The first result states a basic property of connectedness through rotations. 

Lemma 5.1. Let w,v G M°' and d > 2. Then there exists U G with determinant 1 such that 
V = Uw if and only if \\w\\ = \\v\\. 

Proof. If f = Uw we have that v'^v = w^w. Conversely, if \\w\\ = \\v\\, we may choose or- 
thonormal vectors {x£ : i G ± w and {z^ : i G ± v and form the matrices 

R = [w, Xi, . . . , Xd-i) and S = (f , Zi, . . . , Zd^i). We have that R^R = S'^S. We wish to solve 
the equation UR = S. For this purpose we choose U = SR^^ and note that U E O'^ because 
U^U = {R-^yS^SR'^ = {R'^yR^RR-^ = I. Since > 2, in the case that det(f/) = -1 we 
can simply change the sign of one of the xg or zi to get det(f/) = 1 as required. ■ 

The second result concerns the monotonicity of the trace norm. 

Lemma 5.2. Let W,P e Md,n such that W^P = 0. Then \\W + P\\i > \\W\\i. 

Proof. It is k nown that the square root function, t ^-^ t^, is matrix monotone - see, for example. 



[lBhatial . ll997L Sec. V.l]. This means that for any matrices A, 5 G S" , A ^ implies B 
Hence, for any matrices A, B E S", A ^ B implies ti A^ > tiB^. We apply this fact to tl 
matrices W^W + P^ P and P^P to obtain that 

\\w + P||i = tr((iy + py{w + p))5 = tT{w^w + p^p)^ > tT{w^w)^ = \\w\\i . 
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